-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WSLowering] consumer release on each thread instead of master thread #10
Conversation
remoteCTAId, false, 0); | ||
|
||
auto arriveOp = builder.create<ttng::MBarrierArriveOp>( | ||
loc, bufferEmpty, nullptr, nullptr, false, 0); | ||
assert(op.getOperation()->hasAttr("async_task_id")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't quite get this logic around remote-cta. It seems this change gets rid of the remote-cta mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this gets rid of remote-cta. Do you think it is still useful? I'm not sure how it would be used.
unsigned bufferEmptyCount = numCTAs; | ||
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, numCTAs); | ||
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, | ||
THREADS_PER_TASK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this changes from a barrier across CTAs to a barrier within the warp group of 128 threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes from expecting a master thread form a WG running the barrier arrival to all threads within the WG running it.
Nice perf win! |
Having each thread run the consumer release operation simplifies the logics by avoiding master thread id computation. This seems to help improve performance a bit.